Women and insurance pricing policies: a gender-based analysis with GAMLSS on two actuarial datasets

In most of the United States, insurance companies may use gender to determine car insurance rates. In addition, several studies have shown that women over the age of 25 generally pay more than men for car insurance. Then, we investigate whether the distributions of claims for women and men differ in location, scale and shape by means of the GAMLSS regression framework, using microdata provided by U.S. and Australian insurance companies, to use this evidence to support policy makers’ decisions. We also develop a parametric-bootstrap test to investigate the tail behavior of the distributions. When covariates are not considered, the distribution of claims does not appear to differ by gender. When covariates are included, the regressions provide mixed evidence for the location parameter. However, for female claimants, the spread of the distribution is lower. Our research suggests that, at least for the contexts analyzed, there is no clear statistical reason for charging higher rates to women. While providing evidence to support unisex insurance pricing policies, given the limitations represented by the use of country-specific data, this paper aims to promote further research on this topic with different datasets to corroborate our findings and draw more general conclusions.


Women and insurance pricing policies: a gender-based analysis with GAMLSS on two actuarial datasets
Giuseppe Pernagallo 1 , Antonio Punzo 2* & Benedetto Torrisi 2 In most of the United States, insurance companies may use gender to determine car insurance rates.In addition, several studies have shown that women over the age of 25 generally pay more than men for car insurance.Then, we investigate whether the distributions of claims for women and men differ in location, scale and shape by means of the GAMLSS regression framework, using microdata provided by U.S. and Australian insurance companies, to use this evidence to support policy makers' decisions.We also develop a parametric-bootstrap test to investigate the tail behavior of the distributions.When covariates are not considered, the distribution of claims does not appear to differ by gender.When covariates are included, the regressions provide mixed evidence for the location parameter.However, for female claimants, the spread of the distribution is lower.Our research suggests that, at least for the contexts analyzed, there is no clear statistical reason for charging higher rates to women.While providing evidence to support unisex insurance pricing policies, given the limitations represented by the use of country-specific data, this paper aims to promote further research on this topic with different datasets to corroborate our findings and draw more general conclusions.
The research question of this paper stems from a popular belief, common in many countries.There are numerous quips regarding female drivers, who are often depicted as less skilled drivers than men.In Italy, for example, men usually say "donne al volante, pericolo costante", which can be (approximately) translated as "women driving, peril thriving".Albeit the issue may seem frivolous, it assumes great importance from the perspective of insurers, risk analysts and policy makers.If women were indeed worse customers for insurers, gender would represent an important variable to model insurance-related data.This study aims to provide evidence to determine whether insurers are statistically justified in treating women and men differently using claims data.
We have two main research objectives.Firstly, we look for the best model for the loss distribution (a largely debated issue in literature) and we investigate whether gender makes differences in some aspects of the distribution such as, for example, location or scale.Secondly, we evaluate whether gender affects the magnitude of losses, controlling for other available covariates.In particular, we give emphasis to the largest claims (the right tail of the loss distribution), which are of relevant importance for insurance companies.
In summary, our contribution is mainly empirical in nature, but also partly methodological.Empirically, we aim to provide evidence to answer the important policy question of whether gender is a relevant variable for insurers.These results are limited by the use of available data, but have important economic value for both insurers and policy makers (see "Potential limitations" for a discussion).On the other hand, we also contribute to the methodological literature by proposing the use of many statistical models neglected in similar works and introducing a bootstrap test to test for differences between groups in the tails of their distributions.
There are several studies related to the issue.Sivak and Schoettle 1 study the representativeness of gender in six different crash scenarios.Even though the results may be influenced by different factors, the authors find that, in some scenarios, male-to-male crashes tend to be underrepresented, whereas female-to-female crashes tend to be over-represented.
A study of prominent interest for insurers was carried out by Massie et al. 2 on passenger-vehicle travel data.The authors find that elevated crash involvement rates per vehicle-mile of travel are registered for young individuals (aged 16-19) and old drivers (75 and over).Men are more likely to experience a fatal crash whereas combine the characteristics of two or more distributions, so modelling many aspects that a single distribution cannot represent.
We provide novelty to this already large stream of papers in different ways.Firstly, we fit renowned, but also less used, parametric models to important car insurance datasets.Secondly, we avoid the boundary bias issue 29,30 , that in our case means allocation of probability mass to negative values, by considering distributions with a positive support or by applying convenient transformations to distributions defined on the whole real line.We accomplish the latter task by truncation or using a log-transformation.Thirdly, while the aforementioned works are concerned with the whole amount of claims, we fit the competing models splitting the data by gender to see whether relevant differences exist.Finally, we test whether gender makes differences in all (or some of) the parameters of the model used to describe the distribution of claims, and we introduce a bootstrap-based parametric test to see whether significant statistical differences exist between the value at risk (VaR) predicted by the various fitted models for females and males.

Regression modelling
With the second research question we want to assess whether gender has an effect on the magnitude of the claims, controlling for other available covariates.However, traditional regression techniques are problematic when dealing with actuarial data.Rousseeuw et al. 31 point out that in many applications (such as insurance data), outliers have relevant effects on the estimates.Traditional ordinary least squares (OLS) regression does not satisfy the requisite of robustness, because it is sensitive to outliers.Indeed, in the OLS method the underlying distribution is Gaussian 32 whereas insurance data, as discussed in "Distribution modelling", depart severely from a Gaussian distribution.For these reasons, traditional OLS cannot be used for our purpose.Among the many alternative models that can solve these problems, quantile regression gained the favour of many analysts thanks to the fact that quantiles, such as the median, are less sensitive to outliers; moreover, quantile regression models are distribution-free 33 .However, Rigby et al. 34 note that "quantile (and expectile) regressions are less reliable in the extreme tails of the distribution because of sparsity of data points".For this reason, the authors consider an alternative procedure for modelling the tail of a distribution under a regression perspective, which is used in the present work (see "Regression results").From the point of view of an insurer, knowing the behaviour of the data in the tail of the distribution is fundamental to prevent and assess adequately the largest losses.Then, we also explore the relationship between extreme losses and gender.

Data
We worked with two important insurance datasets.The choice of these datasets descends from the need of having enough covariates and a variable for gender.It is important to note that while these are large and reliable datasets, they are country-specific and therefore our results are difficult to generalize.An in-depth discussion of this issue is provided in "Potential limitations".

The automobile bodily injury claims (AutoBi) dataset
The first dataset is freely available in the R package insuranceData and is called "Automobile Bodily Injury Claims" (AutoBi).This dataset derives from a 2002 survey conducted by the Insurance Research Council (IRC), a division of the American Institute for Chartered Property Casualty Underwriters and the Insurance Institute of America.The survey asked participating companies to report claims closed with payment during a designated 2-week period.The sample available in the package is made by 1340 bodily injury liability claims.
The variable of our interest is the claimant's total economic loss (abbreviated as Loss) in thousands of dollars from a single state.Furthermore, thanks to the variable Clmsex, i.e. the claimant's gender, we were able to subset the original data dividing the losses for men and women.The split of the data causes the loss of some observations since the claimant's sex is not available for all the reported losses.The variable Loss is also used in the regression model as dependent variable; however, for the description of the model and the included covariates we invite the reader to look at "AutoBi".This dataset is also used, among the others, by Frees 35 in his book.

The automobile claim datasets in Australia (ausprivauto0405)
The second dataset is freely available in the R package CASdatasets and is named "Automobile claim datasets in Australia".Specifically, we use the dataset ausprivauto0405, made of 67,856 observations, which represent 1-year vehicle insurance policies taken out in 2004 or 2005 in Australia.Among the available policies, 4624 have at least one claim, the rest of the data are all zeros.All the losses are expressed in Australian dollars, but for scaling purposes, we rescaled the data to work with hundreds of dollars.In this case there are no missing observations.The rescaled variable ClaimAmount is also the dependent variable for the regression model, but all the information regarding the model are provided in "ausprivauto0405".This dataset is also used, among the others, by De Jong and Heller 36 in their book.It is important to note that given the presence of many zeros, all the models considered for this dataset have been zero adjusted, which means including a probability mass at 0 37 .In this way we have two different views for the phenomenon: the first dataset is focused only on losses, whereas the second one considers also policy holders without reported losses, in this way accounting for the possibility that car accidents can be more frequent depending on driver's gender.

Methodology
As already detailed in "Literature review", we evaluate the variable of interest, namely the Loss variable (denoted by Y), from the point of view of its distribution ("Distribution modelling") and as a function of some covariates of interest Z ("Regression modelling"), giving particular attention to the Gender variable.For uniformity sake, we handle both these research objectives under a model-based paradigm which uses the very flexible family of generalized additive models for location, scale and shape (GAMLSS), proposed by Rigby and Stasinopoulos 38 to overcome some of the limitations associated with the generalized linear models (GLMs)-such as, for example, the exponential family distribution assumption for the response variable-and generalized additive models (GAMs).In the GAMLSS methodology, the systematic part of the model is expanded to allow equations not only for the mean, but also for the other parameters (scale and shape) of the distribution of the response variable.

The GAMLSS regression framework
A GAMLSS model can be expressed as where D(µ, σ , ν, τ ) is a four-parameter distribution (but it can have less or more parameters), with µ and σ usually characterizing location and scale, respectively, and with ν and τ known as shape parameters (i.e., skewness and kurtosis).We denote with i = 1, . . ., 4 the generic ith equation in the system, η i is a predictor of the parameter (one for each of the four parameters), g i (•) is a function to model the parameter of the distribution (in the empirical part of the paper we use the default functions of the commands gamlss, gamlssML, and gamlssZadj), Z i is a vector of covariates, β i is the coefficient vector, and s ij (•) is a nonparametric smoothing function applied to the covariate z ij , j = 1, . . ., J , with J denoting the number of covariates.The smoothing terms s ij (•) introduce nonlinearities in the model, and are unspecified functions estimated using a scatterplot smoother, in an iterative procedure called the local scoring algorithm 39 .
The form of D(µ, σ , ν, τ ) is general and only implies that the distribution should be in parametric form; it can be any distribution (including highly skew and kurtotic continuous and discrete distributions) and it can model heterogeneity (e.g., cases where the scale or shape of the distribution of the response variable changes with explanatory variables).All the distributions defined on (0, ∞) can be zero adjusted to [0, ∞) by including a probability mass at zero using the gamlss.infpackage 40 .The resulting new distribution can then have up to five parameters, the four parameters of the original distribution defined on (0, ∞) plus a parameter ξ 0 = p 0 = P(Y = 0) ∈ (0, 1) that represents the probability mass at 0. Computationally, the function gen.Zadj() creates a mixed (continuous-discrete) probability density function (pdf) given by where f (y|µ, σ , ν, τ ) denotes the pdf on (0, ∞).

How the research objectives of the paper are handled
Firstly, we look for the best model for the loss distribution (see "Distribution modelling" for related literature) and we investigate whether Gender makes differences in some aspects of the distribution such as, for example, location or scale.We handle this first objective by regressing all the parameters µ , σ , ν and τ of D(µ, σ , ν, τ ) on Gender, i.e. on only one covariate ( Z 1 = Z 2 = Z 3 = Z 4 = Z ) in (1).Thus, in case of differences due to gender in the loss distribution, that we can identify by looking at the significance of the coefficients β 1 , β 2 , β 3 and β 4 in (1), we have the advantage to detect the aspect(s) (location, scale and/or shape) affected by this variable.
We try several models for the loss distribution not only to have a large set of models within which to look for the best one, but also to make the evaluation of gender differences more robust with respect to a wrong model specification.Thanks to the package gamlss and its extensions 41,42 , we consider both classical distributions already defined on (0, ∞) and new distributions on (0, ∞) .These new distributions are created from those with support (−∞, ∞) , using the inverse log (i.e. the exponential) transformation through the function gen.Family() with argument type="log", and by truncation using the function gen.trun() 42 .In detail, we consider the following 30 parametric models: Box-Cox Cole and Green, Box-Cox Power Exponential, Box-Cox t, Burr, Dagum (Burr III), Exponential, Gamma, Generalized Beta type 2, Generalized Gamma, Generalized Inverse Gaussian, Generalized Pareto, Inverse Gamma, Inverse Gaussian, Log-Gumbel, Log-Johnson's SU, Log-Logistic, Log-Normal, Log-Power Exponential, Log-Skew Normal Type 2, Log-Skew t Type 5 43 , Log-t Family, Pareto Type 2, Truncated Exponential Gaussian, Truncated Johnson's SU, Truncated Logistic, Truncated Normal, Truncated Power Exponential, Truncated Skew t Type 5 43 , Truncated t Family, Weibull.
The distributions were fitted via the maximum likelihood (ML) approach.It must be noted that, for the ausprivauto0405 dataset, we did not implement all the distributions because of computational problems related with the zero adjusted routine 44 .However, considering that we use a large number of distributions, it should not be a great loss to exclude these models from the analysis.Once the regression models are fitted, we rank them via the Akaike information criterion (AIC 45 ) and by the Bayesian information criterion (BIC 46 ), which represent the most popular criteria in the actuarial literature 16,18,27,28 .
Secondly, as concerns the objective of assessing the impact of Gender on Loss, controlling for other covariates, we always use the GAMLSS regression framework to model the whole distribution and its tail.The research question in this case pertains to whether female claimants generate higher losses for insurers such that the application of higher rates can be supported by a "fair justification" 11 .The use of heavy-tailed distributions overcomes the problem of extreme values in actuarial datasets.Nonetheless, knowing how gender impacts the (1) www.nature.com/scientificreports/mean or one of the other parameters of the losses distribution is less interesting for insurers than knowing the impact of gender on the tail of the distribution, where the highest losses are placed.To study this portion of the data, without recurring to nonparametric methods like the less reliable quantile regression 34 or more complex approaches like entropic/symbolic methods 47 , we use a procedure that can be found in "Regression results" of the present paper 34,48 .

Comparing the tail behaviour
Comparing the female and male distributions in their tails is important information for insurers because of its relation to VaR.In detail, we define a parametric (model-based) bootstrap test that can be schematized as follows.
1. Compute the sample values at risk, VaR F α and VaR M α , separately for females and males, but at the same probability level α , and compute the test statistic , depending on the available data-to the whole data of size n = n F + n M , where n F and n M are the sample sizes for females and males, respectively.3.For r = 1, . . ., B : (a) generate two samples of sizes n F and n M from the model fitted at step 2; (b) compute the AD statistic, say AD r , on the generated samples.
4. Under H 0 (VaRs for males and females are statistically non-different), AD 1 , . . ., AD B are equally likely and the p value of the testing procedure can be computed as where F Boot (•) is the (stepwise) cumulative distribution function of AD 1 , . . ., AD B 49 .In real data analyses, whose results are described in "Distribution fitting results"-"Regression results", we consider a sufficiently large number of bootstrap replicates ( B = 1000 ); moreover, as usual in the insurance practice/ literature, we consider the probability levels 0.95 and 0.99.

AutoBi data
We start with the AutoBi data described in "The automobile bodily injury claims (AutoBi) dataset".Supplementary figures C.1-C.3 in Appendix C (online) show histograms and normal Q-Q plots for the total amount of losses (Supplementary figure C.1), for the losses reported by female claimants (Supplementary figure C.2), and for the losses reported by male claimants (Supplementary figure C.3).On the histograms we superimpose also a kernel density estimate (the red line) to give an idea on how the density of the observed data should look like.The horizontal axis of the histograms in Supplementary figures C.1-C.2 is restricted to 250 for the sake of readability.
From the Q-Q plots we see that the distribution of losses for both females and males cannot be approximated by a Gaussian distribution (which is quite obvious); furthermore, the underlying distributions appear to be right skewed and heavy-tailed, as we expected.From all the histograms we confirm another recurrent feature of insurance loss data: the presence of a large amount of small losses and a lower number of high losses 16,18 .However, it should be noted that the maximum loss is registered for female claimants (1067.70),whereas the maximum for male claimants is much smaller (222.41).The kernel density estimate in the three cases seems to suggest a similar distribution, highly right-skewed and highly peaked.Further detailed information on the differences among the data can be obtained looking at the descriptive statistics in Table 1.The mean loss is higher for females than males; however, looking at the median (less sensitive to extreme values) we see that there are no remarkable differences.Nonetheless, the variability (and then the risk) is much higher for females, as evidenced by the range and by the standard deviation.The females data are also more skewed and exhibit a more pronounced leptokurtosis.
The VaR shows that an insurer should expect (with confidence at 99%) higher losses for male policy holders.Supplementary Tables A.1-A.3 in Appendix A (online) show the results of the distribution fitting.The results can be summarized as follows.First, we see that among the best models we have the Box-Cox t (selected by both the AIC and BIC as the best model for the total losses and females' losses), the Truncated t and the Truncated Skew-t.Similar results are obtained for the female and male claimants, with a good performance of the Log-Johnson's SU model, whereas also the Generalized Pareto and the Log-Power Exponential are competitive models.Second, we do not observe drastic differences in the selection of models for females and males.Finally, we see that distributions often neglected in applied works, such as the generalized Pareto or the log-Johnson's SU, represent good alternatives to traditional models, whereas the variants of the normal distribution perform poorly for these data.
In order to check whether gender may explain differences in the loss distribution, we ran a GAMLSS regression for each model as described in the first part of "How the research objectives of the paper are handled".The results are reported in Table 2.The coefficient of gender was significant only for few distributions parameters and for an exiguous number of distributions.This is a strong evidence against the fact that the loss distribution is affected by gender, regardless of the considered parametric model.Supplemntary tables A.4-A.6 in Appendix A (online) show the VaR at 95% and 99% (computed numerically) for the three typologies of data for each of the selected models.We compared these results with the observed VaRs.In this case the ranking is very different because is based on the fact that the best distribution is the one that minimises the absolute distance from the empirical VaR.Summarily, we note that the results are very different if we consider a different confidence level.Furthermore, the results for the males in this case seem to differ from the results for the females.This is reasonable since extreme values are placed in the tail of the distribution.To test if these tail differences are statistically significant, we performed the parametric bootstrap test illustrated in "Comparing the tail behaviour"; the results are reported in the left part of Table 3.For many models the differences resulted statistically significant; therefore, we should conclude that for these models the tail behaviour differs by gender.This does not necessarily imply that female claimants are riskier than male claimants, it simply means that VaRs are different.

ausprivauto0405 data
We now analyse the distribution fitting results for the ausprivauto0405 data.Supplementary figures C. 4 male claimants (Supplementary figure C.6).We remember that for scaling purposes the variable ClaimAmount is expressed in hundreds of dollars; furthermore, since we are considering only reported losses, we have excluded for the moment the zeros.In this case there was no need to restrict the horizontal axis of the histograms.The analysis of the histograms and of the normal Q-Q plots confirm the findings observed in the first dataset and characterising the majority of claims data: non-normality deriving from severe right skewness and heavy-tailed distributions, and the fact that the majority of the observations are concentrated in the first bins of the histograms.The analysis of the plots including also the zeros is redundant.Table 4 shows the descriptive statistics for the ausprivauto0405 data (zeros excluded), whereas Table 5 shows the same statistics including also the zeros.We note that with respect to the other dataset, the losses for males are higher, on average and median, and more variable than the females.The females' loss distribution is slightly more peaked but less skewed, whereas the males' distribution including also the zeros shows higher kurtosis and skeweness.The VaR shows that an insurer should expect (with confidence at 99%) higher losses for male policy holders.
Supplementary Tables B.1-B.3 in Appendix B (online) show the results of the distribution fitting.The ZA Generalized Gamma was selected as the best model by both the AIC and BIC for the total claims, and both the females and males claims.The ZA Log-Skew Normal, the ZA Log-Johnson's SU and the ZA Generalized Inverse Gaussian were competitive models for all the three groups of data.Table 6 shows that, for this dataset, gender seems to play a role in explaining differences in the location parameter, and for some distributions also the dispersion parameter.As for the AutoBi data there is weak evidence that gender could explain the shape of the distribution.Supplementary Tables B. 4-B.6 in Appendix B (online) show the estimated VaR values at 95% and 99% using the ZA parametric models.We can say that ZA Truncated Power Exponential, ZA Generalized Pareto and ZA Log-Skew Normal are good models to describe the tail behaviour of these data.As in the previous dataset there are differences between the ranks obtained using the two different levels.However, in this case the VaR bootstrap www.nature.com/scientificreports/tests highlight that there are no significant differences in the tail of the distribution of male and female claimants when we consider a level of 95%, whereas significant differences emerge for a level of 99% (see Table 3).

Regression results
In this section we tackle the second research question of the paper, i.e. whether gender affects the claims distribution controlling for other available covariates.We fit regression models on the whole dataset and on the right tail of the data.The former approach is useful to quantify the effect of gender on the conditional location, scale and shape of the losses, the latter to quantify the effect of gender on the largest claims.For insurance companies this information is of relevant importance because it influences the solvency of the company and its policies.The GAMLSS framework consents to exploit the results of the distribution fitting in order to use the best model as underlying distribution.The choice of functions g i (•) , i = 1, . . ., 4 , to model the parameters of the considered models (refer to "The GAMLSS regression framework") is limited to those available in the gamlss package.To model the tail of the data we used a different approach 34,48 .These are synthetically the steps followed.
1. We fitted a α (95% and 99%) smooth quantile curve for LOSS (or ClaimAmount) against the explanatory variables using the R package cobs with automatic smoothing parameter selection.2. We selected the cases above the α quantile curve to work only with the tail of data.3. We fitted a suitable GAMLSS truncated distribution to the tail data with the fitted α quantile as truncation parameter.Since fitting via regression all the distributions is computationally prohibitive, the choice of an adequate distribution is determined using the best models obtained in "Distribution fitting results".For the whole dataset we used the best model on the total claims distribution, while for the tail of data we used the www.nature.com/scientificreports/best model as suggested by the VaR difference between the empirical VaR and the theoretical VaR.For the asprivauto0405 dataset we used GAMLSS zero-adjusted distributions.4. We fitted regression models to assess the magnitude of the gender coefficient on the distribution of claims using, for the tail of data, the truncated distribution as determined in step 3.

AutoBi
The AutoBi dataset contains the following explanatory variables: • Attorney: whether the claimant is represented by an attorney.
• Clminsur: whether or not the driver of the claimant's vehicle was uninsured.
• Seabelt: whether or not the claimant was wearing a seatbelt.
As before, the dependent variable of the regression model is Loss, the claimant's total economic loss (in thousands of dollars).In order to perform the regression model, we create dummy variables for Attorney (1 if yes), for Clmsex (1 if female), for each marital status, for Clminsur (1 if yes) and for Seatbelt (1 if yes).To avoid the dummy variables trap we exclude from the regression the dummy relative to divorced/separated, which becomes the benchmark category.Due to the presence of missing observations we use listwise deletion to eliminate the rows with missing information, therefore, the final dimension of the dataset in terms of rows is 1091.Tables 8 and 9 show the result of the GAMLSS regressions.We could not fit the best model for the 99% quantile because the cases above it are too few to fit a suitable regression model.Figure 1 shows the wormplots for the AutoBi data.We used also other graphical tools for diagnostics and we estimated many models but we omit them from this paper for the sake of synthesis.The interested reader can contact the corresponding author for further elaborations.

AutoBi: regression model on total claims
In Table 7, we report the results of two regression models.In model I we model only the equation of the parameter µ using all the data and all the explanatory variables.The best model, as suggested by the analysis performed in "AutoBi data", is the Box-Cox t distribution.The coefficient of our interest is the coefficient of Clmsex.Female claimants are associated with a positive and significant (at 5%) increase in the insurer losses (in thousands of dollars).The fit of the model is good enough as evidenced by the wormplot of the model in Fig. 1 (upper-left panel).However, we can obtain better estimates if we model also the other parameters, i.e. the scale parameter σ and the skeweness and kurtosis parameters ν and τ .To achieve this purpose, we gone through several models estimation.These models do not exhaust all the possible cases: given the fact that we can model four equations Table 6.ausprivauto0405: simple regression on gender for all the parameters of the considered models.For some distribution it was not possible to run the regression.Standard errors are given in parentheses.The estimate for ξ 0 is − 0.0165 (s.e.= 0.0308) for all models.*Stands for 5% significance.www.nature.com/scientificreports/using several explanatory variables, the number of cases is high.This happens because not only can we create many models by simply changing the set of explanatory variables among those available (all models with one variable, with two variables, with three variables, and so on) but we can test these different combinations in four different equations (one for the mean, one for the dispersion parameter, and so on).However, we tried to cover all the relevant cases for the research question of this paper.These relevant cases are all those in which it was possible to retain the gender variable (given the research question of this paper), and were considered the best (using information criteria and graphical tools such as wormplots) among those with the gender variable for which the algorithm was able to converge.
Model II represents the best model, with respect to the many models that we estimated, in terms of computational feasibility(with this term we refer to the fact that some models were not computationally feasible and/or showed excessive time complexity), AIC and BIC, and goodness of fit as exhibited by the worm plot.The wormplot (Fig. 1, upper-right panel) shows a better fit since all the points lie within the 95% confidence intervals given by the two elliptic curves.The coefficient of Clmsex preserved the same sign and approximately the same magnitude.On the other hand, Clmsex does not affect significantly the other parameters of the distribution.Finally, the significant coefficients of the other explanatory variables are economically reasonable.For example, considering the µ equation, if the claimant is represented by an attorney, the insurance company tends to pay bigger amounts; if the age of the claimant increases, also the loss for the company increases, probably because elder people suffer more physical damages in car accidents.

AutoBi: regression model on the tail of data
The analysis for the tail of the data is reported in Table 8.In this case the best distribution is selected according the result for the VaR estimation reported in online Supplementary table A.4.Once again, we first estimate a model (III) only for the µ equation and with all the explanatory variables (Widowed is dropped because on 54 cases there were not sufficient observations for this variable).The other model (IV) is again the best one in the sense specified in "AutoBi".In model IV we include a smoother for Clmsex (pb is a smoothing additive term www.nature.com/scientificreports/based on P-splines) for both the µ and σ equations.Modeling also the other equations is not possible due to the low number of cases available in the tail of data.These results are probably more interesting for an insurer.The coefficient of Clmsex is strongly significant and negative in both models.This means that female claimants entail lower losses for insurers, which means that the biggest losses are made for male claimants as confirmed by other works 9,10 .In model IV we also learn that the variable Clmsex has a negative effect also on the scale parameter, which means that female claimants decrease the spread in the tail of the distribution.Both the wormplots of model III and IV show a satisfactory fit (Fig. 1, respectively, lower-left and lower-right panels).Once again, the presence of an attorney is associated with the biggest losses for the company.

ausprivauto0405
The dataset asprivauto0405 contains 9 variables.The dependent variable in our study is ClaimAmount, which is the sum of claim payments.In this case we do not use the term loss because the variable ClaimAmount contains also zeros.The explanatory variables available in the dataset are: • Exposure: the number of policy years.
• VehValue: the vehicle value in thousand of Australian dollars.
• VehAge: The vehicle age group divided into 4 classes: old cars, oldest cars, young cars and youngest cars.We created a dummy variable for each category.• VehBody: the vehicle body group divided into 13 classes: Bus, Convertible, Coupe, Hardtop, Hatchback, Minibus, Motorized caravan, Panel van, Roadster, Sedan, Station wagon, Truck and Utility.We created a dummy variable for each category.• Gender: the gender of the policyholder.We created a dummy variable for female claimants (Female).

ausprivauto0405: regression model on the tail of data
We shift now our attention to the tail of the distribution.Since now we deal with data above the 95% and 99% quantiles, we are eliminating from the analysis all the zeros and dealing only with losses.In this case the regression framework becomes again the traditional GAMLSS without any need for zero-adjustment.Moreover, including the variable ClaimOcc becomes redundant because in the tail there are only realised claims.
Table 10 shows the results of the best model for cases above the 95% quantile among many competing models.The choice of the Truncated Power Exponential was determined by the results obtained comparing the empirical VaR with the VaR predicted by the models (online Supplementary table B.4, Appendix B).One may notice that the analysis of VaR was conducted using ZA distributions, but this is a minor concern since the wormplot shows that the model offers a good fit for the data (Fig. 2, upper-right panel).The coefficient of Female is significant and positive in the µ equation, which means that claims in the tail increase for female claimants, whereas the coefficient of Female for the scale parameter is non-significant.We excluded the variable from the ν equation because it was non-significant and it affected severely the goodness of fit of the model.
Table 11 shows two possible models to describe the behaviour of extreme losses.Both models are good in terms of fit as highlighted by the wormplots in Fig. 2.However, model VII should be preferred in terms of AIC and BIC.In model VIII the variable Female was removed from the equation for the location parameter because it was non-significant.The choice of the underlying distributions is again determined by computational feasibility and the results of Supplementary table B.4 (Appendix B, online).The coefficient of the variable Female is negative and significant at 10% for the location parameter in model VII and for the dispersion parameter in model VIII.These results are in line with the observed tail behaviour in the AutoBi dataset (Table 8).

Potential limitations
In this section, we address a series of shortcomings that could undermine the validity of our results.

Dataset
Finding adequate data when dealing with actuarial studies is a relevant problem.Since in most cases researchers need micro-data, these data should contain enough information, especially when one aims to run regressions.In our case a suitable dataset must report the claimant's gender and a sufficient number of other variables to avoid endogeneity problems.Furthermore, the ideal dataset should include an high number of observations and should contain data on a relevant geographical context to draw useful policy proposals.Nonetheless, the search of these data was not painless.We think that the data used in our study are a good compromise.The AutoBi dataset allows us to study the American context, where the problem of pricing based on gender is currently relevant.Moreover, the ausprivauto0405 dataset allows us to extend the analysis to a different geographical context, including also policy holders with no claims.One may argue that the data used are old.We think that this is not a serious problem for many reasons.It is customary in actuarial studies to work with important and established datasets.Working with reliable and significant data is more important than working with new data.Furthermore, as already mentioned, finding data is very difficult.The literature is plenty of works dealing with older but established datasets.Just to mention: the Danish Fire losses dataset contains data gathered over the period 1980-1990, yet it is still one of the most used in contemporary studies 18,27 ; Fuzi et al. 33 used private car policies in year 2001; Blostein and Miljkovic 28 used data for the time period 1988-2001.Another relevant aspect to consider is that the distribution of claims generally presents the same statistical features over time and across countries.
We are aware of the fact that many other variables should have been added in the model, such as locations of accident, time of the accident, reason of the accident (drug, traffic rule disregard, etc.) and so on.Nonetheless, a dataset with such a detailed information, to our knowledge, is not freely accessible.The data used in this paper are among the most complete we could have found.Nevertheless, we must stress that the use of country-specific data limits the conclusions drawn from these datasets to the cases analyzed; therefore, further research using the same methodology but different data would help corroborate the results of this work.In this regard, the hope is that more insurers will make the data freely available to advance actuarial research.

Causality
The regression models used in our analysis served to study the relationship between gender and claims; however, no causal effect can be drawn from this setup.The point is that even conceiving a study capable of assessing the existence of a causal effect is troublesome because car accidents, and hence the amount of claims, are too complex to ideate any experiment.The lack of data makes this problem even worse.Nonetheless, the study of correlations is important to investigate whether a fair justification supporting a pricing practice exists.

External validity
One major drawback from using data of US and Australian companies is the impossibility of drawing general conclusions also for other countries.In general, a representative sample is needed to generalize the results to different countries.As one of the referees pointed out, it is reasonable to assume that our data are not representative of the many policy holders who have contracts with insurance companies.This obviously limits the application of our results to the scenarios analyzed, and their application to broader contexts depends strictly on how close one thinks our data are to a representative sample.Despite this, our results are useful for different reasons.First, as we point out in "Introduction", the problem of price discrimination based on gender is particularly relevant in the US.This work therefore can be used to provide statistical substance to the debate.Second, Australia and USA are two prominent markets for insurers worldwide.Third, even though driving habits are very different from country to country, countries with similar backgrounds can still use the results of our analysis.Fourth, the loss distribution is characterized by stylized facts that make the present study useful also for different data.Finally, our work can serve as a stimulus to produce further empirical evidence on this topic, providing new insights into the external validity of our results.

Conclusions and policy implications
This paper provides several results that extend and enrich the existing literature.These results can be split into two parts.In the first part of the paper, we focus our attention on finding the best statistical model to describe the distribution of claims.The variables investigated are taken from two important R packages.The Autobi dataset allows us to work on losses, as is commonly done in the literature 16,18,27 , whereas the ausprivauto0405 includes also zeros, allowing us to adopt the zero-adjusted distribution framework.Moreover, we conduct the analysis not only on total claims but also distinguishing by gender and analysing the tail behaviour of the data. of a negative effect on location when considering all the data and on extreme losses (cases above 99%), and a positive effect when considering cases above 95%.The negative effect on the location parameter on the whole dataset is, in our opinion, a more reliable result than the positive effect for the AutoBi dataset because the inclusion of zeros accounts for the fact that females can be safer policy holders.
Nonetheless, the regression framework presents some limits.The principal limits are related to the high complexity of the computational routines and to the lack of data.We must rely on the adequacy of the control variables provided in the R packages.The strength of the empirical analysis is that the GAMLSS framework allowed us to study the phenomenon thoroughly, including also equations for the other parameters of the distribution (quite often neglected in empirical works) and weighting also the information carried by the zeros.The main limitation is the use of old, country-specific data, which reduces the scope of these results, although the analysis is robust and allows useful policy implications to be drawn for many countries.
In conclusion, our research enlightened that finding a "fair justification" 11 for applying different rates to male and female claimants is difficult.However, female claimants seem in most of the investigated cases to decrease the location parameter for extreme losses and when zeros are included.Furthermore, in our data female claimants have a beneficial effect on the scale parameter of claims, since for females the spread of losses decreases.We do not think that these results represent incontestable statistical reasons to differentiate policy rates by gender.Indeed, if we read our results together with other works that show that female policy holders are safer than men, we do not see any clear reason to charge women with higher rates.The same argument can be made for male policy holders.The evidence collected suggests in part that men may be riskier for insurance companies in some cases, but the evidence is not strong enough to justify charging higher rates.Future research can make use of the methodology presented in this paper to see if similar results are obtained for different data.In any case, this paper offers guidance to policy makers in the countries considered on whether unisex pricing policies should be promoted. https://doi.org/10.1038/s41598-024-52959-8

Figure 2 .
Figure 2. Wormplots of model V-VIII (Tables 9, 10, 11) for the ausprivauto0405 dataset.Upper panels: model V on the left, model VI on the right.Lower panels: model VII on the left, model VIII on the right.

Table 1 .
Automobile bodily injury claims dataset: descriptive statistics of loss data.

Table 2 .
-C.6 in Appendix C (online) show histograms and normal Q-Q plots for the total amount of losses (Supplementary figure C.4), for the losses reported by female claimants (Supplementary figure C.5) and for the losses reported by AutoBi: simple regression on gender for all the parameters of the considered models.For some distribution it was not possible to run the regression.Standard errors are given in parentheses.*Stands for 5% significance.

Table 3 .
p-values of the parametric bootstrap tests for the hypothesis that the predicted VaRs by the models for males and females are statistically non-different.95% and 99% levels are considered.ZA stands for zeroadjusted.

Table 4 .
ausprivauto0405: descriptive statistics of loss data excluding the zeros.

Table 5 .
ausprivauto0405: descriptive statistics of claims data including also the zeros.

Table 7 .
Results of the GAMLSS regression on the AutoBi data.Standard errors are given in parentheses.Model I and II assume as underlying distribution the best distribution for total claims (online Supplementary table A.1, Appendix A) as univocally determined by AIC and BIC.*indicates 10% significance, **indicates 5% significance, ***indicates 1% significance.

Table 8 .
Results of the GAMLSS regression on the AutoBi dataset for the tail of data (cases above 95% quantile).Standard errors are given in parentheses.Model III and IV assume as underlying distribution the best distribution based on the difference between the empirical VaR and the distribution-based VaR for a 95% confidence level for total claims (online Supplementary table A.4, Appendix A). *indicates 10% significance, **indicates 5% significance, ***indicates 1% significance.

Table 9 .
Results of the GAMLSS regression on the ausprivauto0405 dataset.Standard errors are given in parentheses.The model assumes as underlying distribution the ZA Log-Johnson's SU, which is the fourth best model for total claims (online Supplementary table B.1, Appendix B) as univocally determined by AIC and BIC.*indicates 10% significance, **indicates 5% significance, ***indicates 1% significance.

Table 10 .
Results of the GAMLSS regression on the ausprivauto0405 dataset for the tail of data (cases above 95% quantile).Standard errors are given in parentheses.The model assumes as underlying distribution the best distribution (Truncated Power Exponential) based on the difference between the empirical VaR and the distribution-based VaR for a 95% confidence level for total claims (online Supplementary table B.4, Appendix B). *indicates 10% significance, **indicates 5% significance, ***indicates 1% significance.